ci: Add automatic flaky test detector#18684
Conversation
node-overhead report 🧳Note: This is a synthetic benchmark with a minimal express app and does not necessarily reflect the real-world performance impact in an application.
|
dev-packages/e2e-tests/test-applications/node-express/tests/errors.test.ts
Outdated
Show resolved
Hide resolved
| for (const [key, value] of Object.entries(vars)) { | ||
| const pattern = new RegExp(`\\{\\{\\s*env\\.${key}\\s*\\}\\}`, 'g'); | ||
| title = title.replace(pattern, value); | ||
| issueBody = issueBody.replace(pattern, value); |
There was a problem hiding this comment.
String replace special patterns corrupt job name substitution
The replace() calls interpret special dollar-sign sequences in the replacement string ($&, $', $$, $1, etc.) rather than treating them literally. If a job name happens to contain patterns like $& or $1, the issue title and body would contain the matched template placeholder or captured groups instead of the literal job name. While job names rarely contain such patterns, this could cause confusing issue titles when they do.
|
I'm not 100% sure on this. The new ticket is unassigned by default and, I think, it doesn't get an SLA set in Linear.
To understand this use case - don't we get reported within Sentry if there are flaky tests? We could simply adapt the alert: https://sentry.sentry.io/issues/alerts/rules/details/267675/ (which we can also assign to people) |
|
@JPeer264 We only get alerted if there are more than 10 flaky tests detected within an hour, which is a good start but I think not ideal (quite arbitrary since it depends on how much people are pushing to PRs, for instance). My main goal was to increase visibility of flaky tests, which I think we already get from having them in the issue feed instead of needing to manually check for failing tests and then creating issues so somebody can fix them. I am not exactly sure how/when SLAs are assigned , but we can probably get an SLA by assigning the correct label? Automatically having someone assigned sounds difficult to me though, maybe flaky CI issues can just be another stream that we can look out for during triage. Happy to explore other options as well (e.g. using the alerts we have), just wanted to put this out to see if we can maybe improve the process around this a bit |
Codecov Results 📊Generated by Codecov Action |
| per_page: 100 | ||
| }); | ||
|
|
||
| const failedJobs = jobs.filter(job => job.conclusion === 'failure'); |
There was a problem hiding this comment.
Timed-out jobs excluded from flaky test detection
Medium Severity
The filter job.conclusion === 'failure' excludes jobs that timed out, which have conclusion === 'timed_out' in the GitHub API. However, the step condition contains(needs.*.result, 'failure') treats timeouts as failures (since needs.*.result maps timeouts to 'failure'). This mismatch means when a test times out on develop, the step runs but finds no matching jobs, causing no issue to be created for potentially flaky tests that intermittently exceed time limits.
|
This pull request has gone three weeks without activity. In another week, I will close it. But! If you comment or otherwise update it, I will reset the clock, and if you apply the label |
|
Closing due to inactivity after stale warning. Comment or reopen when ready to continue, and use |
Semver Impact of This PR🟢 Patch (bug fixes) 📋 Changelog PreviewThis is how your changes will appear in the changelog. New Features ✨Core
Deps
Bug Fixes 🐛
Internal Changes 🔧
🤖 This preview updates automatically when you update the PR. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Issues created for all failed jobs, not just required ones
- I confirmed the script was including failures from non-required jobs and fixed it by restricting issue creation to failed jobs whose names match the required job set.
Or push these changes by commenting:
@cursor push cd11773317
Preview (cd11773317)
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -1192,7 +1192,28 @@
per_page: 100
});
- const failedJobs = jobs.filter(job => job.conclusion === 'failure');
+ // `listJobsForWorkflowRun` includes non-required jobs as well, so only keep failures from required jobs.
+ const requiredJobNamePatterns = [
+ /^Build$/,
+ /^Browser Unit Tests$/,
+ /^Bun Unit Tests$/,
+ /^Deno Unit Tests$/,
+ /^Node \(\d+\) Unit Tests$/,
+ /^Node \(\d+\)( \(TS .+\))? Integration Tests$/,
+ /^Cloudflare Integration Tests$/,
+ /^Playwright .+ Tests$/,
+ /^PW .+ Tests$/,
+ /^Remix \(Node \d+\) Tests$/,
+ /^E2E .+ Test$/,
+ /^Upload Artifacts$/,
+ /^Lint$/,
+ /^Check file formatting$/,
+ /^Circular Dependency Check$/,
+ /^Size Check$/,
+ ];
+ const failedJobs = jobs.filter(
+ job => job.conclusion === 'failure' && requiredJobNamePatterns.some(pattern => pattern.test(job.name)),
+ );
if (failedJobs.length === 0) {
console.log('No failed jobs found');This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e5db420. Configure here.
| per_page: 100 | ||
| }); | ||
|
|
||
| const failedJobs = jobs.filter(job => job.conclusion === 'failure'); |
There was a problem hiding this comment.
Issues created for all failed jobs, not just required ones
Medium Severity
listJobsForWorkflowRun returns every job in the workflow run, not just the ones in the needs list. After filtering for conclusion === 'failure', the script creates "Flaky CI" issues for any failed job, including non-required ones like job_optional_e2e_tests and job_node_overhead_check — both of which run on the develop branch but are intentionally excluded from the needs list of job_required_jobs_passed. When a required job fails (triggering this step) and a non-required job also happens to fail, spurious flaky-test issues would be opened for jobs that aren't expected to always pass.
Reviewed by Cursor Bugbot for commit e5db420. Configure here.
size-limit report 📦
|
JPeer264
left a comment
There was a problem hiding this comment.
Let's try it. If it doesn't work we just revert or adapt 🥇



Manually checking for flakes and opening issues is a bit annoying. I was thinking we could add a ci workflow to automate this. The action only runs when merging to develop. Could also be done on PRs but seems unnecessarily complicated. My thinking is that for a push to develop to happen, all the test must first have passed in the original PR. Therefore if the test then fails on develop we know it's a flake. Open for ideas/improvements/cleanups or let me know if there might be any cases I am missing that could lead to false positives.
Example issue created with this: #18693
It doesn't get all the details but I think basically the most important is a link to the run so we can then investigate further. Also the logic for creating the issues is a bit ugly, but not sure if we can make it cleaner given that I want to create one issue per failed test not dump it all into one issue.